Reconciliation plan#13748
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
a057fa0 to
46e274e
Compare
|
/review |
There was a problem hiding this comment.
Pull request overview
Introduces a new “reconciliation plan” pipeline for docker compose create by replacing the legacy convergence flow with: observed state collection → plan generation (DAG) → parallel plan execution.
Changes:
- Added new reconciliation components:
ObservedState,Plan,reconcilelogic, and a parallelexecutePlanexecutor with grouped progress events. - Refactored
create()to use the new pipeline (including “Running” events for unchanged containers) and removed most of the oldconvergenceimplementation. - Updated
runto resolveservice:references withoutconvergence, and added/updated unit tests for the new modules.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| reconciliation.md | Architecture/design doc for the new reconciliation engine and pipeline. |
| pkg/compose/run.go | Replaces convergence-based service-reference resolution for run with a new helper. |
| pkg/compose/reconcile_test.go | Adds unit tests validating plan output for networks/volumes/containers/orphans. |
| pkg/compose/reconcile.go | Implements reconciler producing a deterministic DAG plan for infra + containers. |
| pkg/compose/progress.go | Removes runningEvent helper (replaced by direct newEvent usage). |
| pkg/compose/plan_test.go | Adds unit tests for Plan and OperationType stringification. |
| pkg/compose/plan.go | Introduces Plan, PlanNode, Operation, and OperationType abstractions. |
| pkg/compose/observed_state_test.go | Adds tests for observed-state snapshot construction and classification. |
| pkg/compose/observed_state.go | Adds snapshot types + collection logic; adds “Running” events emission helper. |
| pkg/compose/executor_test.go | Adds executor tests using mocked API client calls. |
| pkg/compose/executor.go | Implements parallel plan execution, operation handlers, and grouped progress events. |
| pkg/compose/create.go | Refactors create() to build observed state → reconcile → emit running events → execute plan. |
| pkg/compose/convergence.go | Removes convergence struct/logic; keeps/exports service-reference resolution helpers. |
| pkg/compose/containers.go | Adds getContainersByService() and removes Containers.names(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ensureImagesExists -> ensureNetworks/Volumes -> collectObservedState -> reconcile -> executePlan | ||
| (inchange) (inchange) (1) (2) (3) | ||
| ``` |
There was a problem hiding this comment.
Spelling: "inchange" should be "inchangé".
ff0c5cd to
136845b
Compare
|
/review |
|
❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs. |
8f2d032 to
b812190
Compare
|
/review |
|
❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs. |
b812190 to
f418dc4
Compare
f418dc4 to
b812190
Compare
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
String() is designed to make it easy to compare coomputed plan vs expectations Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
- Remove the `convergence` struct and `newConvergence` constructor - Extract `resolveServiceReferences` as a standalone function taking `map[string]Containers` instead of a method on convergence - Add `getContainersByService` helper on composeService - Update run.go and executor.go to use the new standalone function - Remove dead code: `getObservedState`, `setObservedState` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
…remove - Fix ContainerReplaceLabel detection: use op.Inherited != nil (not op.Container) as signal for recreate in execCreateContainer - Use observed network name (not desired) for DisconnectNetwork and RemoveNetwork operations, in case the name changed - Use observed volume name (not desired) for RemoveVolume operations - Update reconciliation.md with 3 new lessons learned (7.8, 7.9, 7.10) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
b812190 to
08da1d5
Compare
08da1d5 to
aaa34eb
Compare
- plan.go: pin OperationType constants to explicit values so adding an op in the middle doesn't shift the others. - executor.go: remove the meaningless `var _ = getContainerProgressName` line — same-package functions are always accessible. - reconcile.go: fix the stale switch-default comment that contradicted the case clause above it. - reconcile.go: drop the local `serviceLabel` const that shadowed `api.ServiceLabel` and use the shared constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
ensureProjectVolumes already prompts the user when a volume's config hash diverges from the compose file (create.go:1626) and recreates it on confirm. The reconciler ran after ensureProjectVolumes and prompted again with the exact same message — so a user who declined the first prompt was asked the same question a second time. Drop the prompt + recreate call from reconcileVolumes(). Recreation of diverged volumes stays owned by ensureProjectVolumes; the reconciler only plans the creation of missing volumes. If the user declined recreation, the existing container's mounts still match the existing volume name and hasVolumeMismatch correctly returns false, so containers are not falsely flagged as obsolete. Keep the supporting infrastructure available for future use, when divergence detection migrates fully into the reconciler: - reconciler.prompt field - prompt parameter on reconcile() - planRecreateVolume function (//nolint:unused) - servicesUsingVolume function (//nolint:unused) - noPrompt test helper The reframed test (TestReconcileVolumes_DivergedIsIgnored) asserts the new contract: a diverged volume produces no plan operations from the reconciler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Two related changes to the executor, plus the small cleanups they attracted in review: * Rename node consults a planner-set CreateNodeID instead of walking ancestors. The old execRenameContainer searched node.DependsOn for a CreateContainer result and fell back to a recursive walk through the group chain. That worked only by convention; a future op with cross- node data needs would have to rediscover or copy the pattern. Now the rename op carries an explicit CreateNodeID int set at plan time and the executor reads pctx[CreateNodeID].ContainerID directly. The recursive findCreatedIDInChain is gone. * Stop re-listing containers on every create. execCreateContainer used to call getContainersByService(ctx, projectName) — a fresh ContainerList per create — to resolve service references at execute time. The executor now holds a live containersByService view seeded from ObservedState (via observed.containersByService()) and grown as OpCreateContainer nodes complete, so service references resolve from memory. On OpRemoveContainer the removed container is dropped from the view via slices.DeleteFunc, so a dependent's create that resolves network_mode: service:x against the just-removed container cannot pick up a stale ID (Containers.sorted() orders by canonical name and would otherwise return the removed container). * Defensive slices.Clone of op.Service.VolumesFrom in execCreateContainer. resolveServiceReferences mutates VolumesFrom in place, and the shallow struct copy of *op.Service still shares the backing array. Single-execution-per-node makes it safe today, but the clone removes the trap for any future parallel-execution mode. * Operation gains a CreateNodeID int (not a *PlanNode pointer) to avoid a structural cycle between Operation and PlanNode. OperationType values are pinned to explicit integers so adding an op in the middle cannot silently shift the others. * execRenameContainer carries two checks so a missing CreateNodeID and an empty produced ID are distinguishable in logs. Both are programmer invariants (prefixed "internal:"). * containersByServiceFromObserved moved from a package-level helper to a method on *ObservedState. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Three improvements identified in the Principal Engineer pass but deliberately deferred: 1. Test fidelity. Split executePlan into newPlanExecutor (constructs the executor seeded from observed state) and (*planExecutor).run (walks the DAG). Production callers go through executePlan unchanged. TestExecutePlanRemoveContainerDropsFromCache now uses newPlanExecutor + run, exercising the same errgroup, done-channel and group-tracker wiring as production instead of a hand-rolled loop over executeNode. 2. //nolint:unused chain. The three preserved helpers (reconciler.prompt, planRecreateVolume, servicesUsingVolume) each carried a separate "kept for future" comment. Consolidate the rationale on the reconciler.prompt field doc and point the helper nolint directives there, so a future cleanup is a single grep. 3. Concurrency test. Add TestExecutePlanConcurrentRemovesCacheCoherence which builds N independent Stop→Remove chains in one plan; the errgroup fans them out across goroutines that all hit containersByService under the mutex. Passes under -race. Failure would expose a missing or incorrect lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
executor.go had grown to ~480 lines mixing three concerns: - DAG orchestration (run, executeNode, planExecutor struct) - per-OpType implementations (execCreateNetwork, …, execRenameContainer) - event tracking (groupTracker + per-OpType emit helpers) Split into three files of ~170 lines each, one concern per file: - executor.go — planExecutor, reconciliationContext, run, executeNode dispatch - executor_ops.go — all execXxx methods - executor_events.go — groupTracker + emitStartEvent / emitDoneEvent / emitErrorEvent Pure refactor: no functional change. Tests (incl. -race) and lint both pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
aaa34eb to
3900c50
Compare
Three fixes surfaced by an external code review of the new reconciler: 1. Scale-down now propagates through serviceNodes. When a service is scaled down (all containers in excess), reconcileService used to continue without assigning lastNode, leaving r.serviceNodes[svc] unset. Dependent services then declared no edge on the scale-down ops and could start before the cleanup finished. Track the RemoveContainer node as lastNode so depends_on chains pick it up. 2. mustRecreate errors are no longer silently ignored by sortContainers. The comparator used `obsi, _ := r.mustRecreate(...)`, falling back to false on any hashing error. Pre-compute obsolescence into a map keyed by container ID before sorting and propagate the error to reconcileService. 3. A container is no longer Stopped twice when its network and its config both diverge. planRecreateNetwork already stops the affected container as part of the disconnect/remove/recreate dance; the subsequent planRecreateContainer (triggered via hasNetworkMismatch) used to add another OpStopContainer against the now-stopped target. Track stops in r.stoppedByPlan; planRecreateContainer reuses an existing Stop node when present, and chains its Remove on both that Stop and the replacement Create. Two golden tests (TestReconcileNetworks_Diverged*) are updated to reflect the new, dedupe'd plan shape (one Stop instead of two per recreated container). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
Two coverage gaps surfaced by an external review: - TestReconcileContainers_DependsOnChain: asserts that a service B with depends_on: [A] produces a CreateContainer for B that depends on A's last plan node (the serviceNodes mechanism in infrastructureDeps). This was the only depends_on-via-plan-DAG behavior untested before. - TestReconcileContainers_DependsOnScaleDown: companion test that exercises the scale-down → dependent path specifically, verifying that the previous commit's lastNode-on-scale-down fix actually wires the dependency through. - TestOperationTypeString: adds OpRunProvider to the table; all other OperationType values were already covered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
f5336ce to
7bbbc77
Compare
External reviewer noted that the alreadyStopped branch adds createNode to removeDeps while the !alreadyStopped branch does not — semantically correct but fragile, since it relies on the implicit invariant that stopNode.DependsOn contains createNode in the !alreadyStopped path. Spell out the invariant in a comment so a future maintainer who edits the stop → create edge in the normal path knows they must also add createNode unconditionally in the remove deps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Guillaume Lours <glours@users.noreply.github.com>
this is an alternative for #13641
based on previous work, I first generated an analysis an plan document, so AI work on this significant refactoring is better guided and impact is well understood